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Abstract. Probabilistic model checking is an automated techniąue to 
verify whether a probabilistic system, e.g., a distributed network protocol 
which can exhibit failures, satisfies a temporal logie property, for exam- 
ple, “the minimum probability of the network recovering from a fault in 
a given time period is above 0.98”. Dually, we can also synthesise, from a 
model and a property specibcation, a strategy for controlling the system 
in order to satisfy or optimise the property, but this aspect has received 
less attention to datę. In this paper, we give an overview of methods for 
automated verification and strategy synthesis for probabilistic Systems. 
Primarily, we focus on the model of Markov decision processes and use 
property specifications based on probabilistic LTL and expected reward 
objectives. We also describe how to apply multi-objective model check¬ 
ing to investigate trade-offs between several properties, and extensions 
to stochastic multi-player games. The paper concludes with a summary 
of futurę challenges in this area. 


1 Introduction 

Probabilistic model checking is an automated techniąue for verifying ąuantita- 
tive properties of stochastic Systems. Like conventional model checking, it uses a 
systematic exploration and analysis of a system model to verify that certain re- 
ąuirements, specified in temporal logie, are satisfied by the model. In probabilis¬ 
tic model checking, models incorporate information about the likelihood and/or 
timing of the systenTs evolution, to represent uncertainty arising from, for exam- 
ple, component failures, unreliable sensors, or randomisation. Commonly used 
models include Markov chains and Markov decision processes. 

Properties to be verified against tlrese models are specified in probabilistic 
temporal logics such as PCTL, CSL and probabilistic LTL. These capture a vari- 
ety of quantitative correctness, reliability or performance properties, for example, 
“the maximum probability of the airbag fading to deploy within 0.02 seconds is 
at most 10 -6 ”. Tool support, in the form of probabilistic model checkers such as 
PRISM [23] and MRMC [27], has been used to verify quantitative properties of 
a wide variety of real-life Systems, from wireless communication protocols na, 
to aerospace designs [5], to DNA circuits [52] . 

One of the key strengths of probabilistic model checking, in contrast to, for 
example, approximate analysis techniąues based on Monte Carlo simulation, is 
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the ability to analyse quantitative properties in an exhaustive manner. A prirne 
example of this is when analysing models that incorporate both probabilistic 
and nondeterministic behaviour, such as Markov decision processes (MDPs). In 
an MDP, certain unknown aspects of a system’s behaviour, e.g., the scheduling 
between components executing in parallel, or the instructions issued by a con- 
troller at runtime, are modelled as nondeterministic choices. Each possible way 
of resolving these choices is referred to as a strategy. To verify a property <f> on 
an MDP Ad, we check that cj> holds for all possible strategies of Ad. 

Alternatively, we can consider the dual problem of strategy synthesis , which 
finds some strategy of Ad that satisfies a property tj>, or which optimises a spec- 
ified objective. This is morę in linę with the way that MDPs are used in other 
fields, such as planning under uncertainty [34] . reinforcement learning |42j or 
optimal control [5]. In the context of probabilistic model checking, the strategy 
synthesis problem has generally received less attention than the dual problem of 
verification, despite being solved in essentially the same way. Strategy synthesis, 
however, has many uses; examples of its application to datę include: 

(i) Robotics. In recent years, temporal logics such as LTL have grown increas- 
ingly popular as a means to specify tasks when synthesising controllers for 
robots or embedded Systems m ■ In the presence of uncertainty, e.g. due to 
unreliable sensors or actuators, optimal controller synthesis can be performed 
using MDP model checking techniąues [51] , 

(ii) Security. In the context of Computer security, model checking has been used 
to synthesise strategies for malicious attackers, which represent fławs in se¬ 
curity systenrs or protocols. Probability is also often a key ingredient of 
security; for example, in [41], probabilistic model checking of MDPs was 
used to generate PIN guessing attacks against hardware security modules. 

(iii) Dynamie power management. The problem of synthesising optimal (ran- 
domised) control strategies to switch between power States in electronic 
devices can be solved using optimisation problems on MDPs 7j or, alter- 
natively, with multi-objective strategy synthesis for MDPs EU. 

In application domains such as these, probabilistic model checking offers vari- 
ous benefits. Firstly, as mentioned above, temporal logics provide an expressive 
means of formally specifying the goals of, for example, a controller or a malicious 
attacker. Secondly, thanks to formal specifications for models and properties, 
rigorous mathematical underpinnings, and the use of exact, exhaustive solution 
methods, strategy synthesis yields controllers that are guaranteed to be correct 
(at least with respect to the specified model and property). Such guarantees may 
be essential in the context of safety-critical Systems. 

Lastly, advantage can be drawn from the significant body of both past and 
ongoing work to improve the efficiency and scalability of probabilistic verifica- 
tion and strategy synthesis. This includes methods developed specifically for 
probabilistic model checking, such as symbolic techniąues, abstraction or sym- 
metry reduction, and also advances from other areas of Computer science. For 
example, renewed interest in the area of synthesis for reactive Systems has led to 
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significantly improved methods for generating the automata needed to synthe- 
sise strategies for temporal logics such as LTL. Parallels can also be drawn with 
verification techniąues for timed Systems: for example, UPPAAL [53], a model 
checker developed for verifying timed automata, has been used to great success 
for synthesising Solutions to real-time task scheduling problems, and is in many 
cases superior to alternative state-of-the-art methods T], 

In this paper, we give an overview of methods for performing yerification and 
strategy synthesis on probabilistic Systems. Our focus is primarily on algorith- 
mic issues: we introduce the basie ideas, illustrate them with examples and then 
summarise the techniąues reąuired to perform them. For space reasons, we re- 
strict our attention to finite-state models with a discrete notion of time. We also 
only consider complete information scenarios, i.e. where the State of the model 
is fully obseryable to the strategy that is controlling it. 

Primarily, we describe techniąues for Markov decision processes. The first 
two sections provide some background materiał, introduce the strategy synthesis 
problem and summarise methods to solve it. In subseąuent sections, we describe 
extensions to multi-objective yerification and stochastic multi-player games. We 
conclude the paper with a discussion of some of the important topics of ongoing 
and futurę research in this area. 

An extended yersion of this paper, which includes fuli details of the algorithms 
needed to perform strategy synthesis and additional worked examples, is avail- 
able as [50]. The examples in both yersions of the paper can be run using PRISM 
(and its extensions [15]). Accompanying PRISM files are available Online pE5] , 

2 Markov Decision Processes 

In the majority of this paper, we focus on Markov decision processes (MDPs), 
which model Systems that exhibit both probabilistic and nondeterministic be- 
haviour. Probability can be used to model uncertainty from a yariety of sources, 
e.g., the unreliable behaviour of an actuator, the failure of a system component 
or the use of randomisation to break symmetry. 

Nondeterminism, on the other hand, models unknown behaviour. Again, this 
has many uses, depending on the context. When using an MDP to model and 
yerify a randomised distributed algorithm or network protocol, nondeterminism 
might represent concurrency between multiple components operating in parallel, 
or underspeciftcation , where some parameter or behayiour of the system is only 
partially defined. In this paper, where we focus mainly on the problem of strategy 
synthesis for MDPs, nondeterminism is morę likely to represent the possible 
decisions that can be taken by a controller of the system. 

Formally, we define an MDP as follows. 

Definition 1 (Markov decision process). A Markov decision process (MDP) 
is a tupie A4=(S, s, A, Sm, Lab) where S is a finite set of states, s £ S is an 
initial state, A is a finite set of actions, Sm : SxA —> Dist(S) is a (partial) 
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probabilistic transition function, mapping state-action pairs to probability distri- 
butions over S, and Lab : S —> 2 AP is a labelling function assigning to each state 
a set of atomie propositions taken from a set AP. 

An MDP models how the state of a system can evolve, starting from an initial 
state s. In each state s, there is a choice between a set of enabled actions A(s) C 
A, where A(s) = f {a £ A \ Sm(s, a) is defined}. The choice of an action a £ A is 
assumed to be nondeterministic. Once selected, a transition to a successor state 
s' occurs randomly, according to the probability distribution <5>i(s, a), i.e., the 
probability that a transition to s' occurs is Sm (s, a) (s'). 

A path is a (finite or infinite) seąuence of transitions tt = so—^»si—4 • • • 
through MDP AA, i.e., where Si £ S, at £ A(si ) and SM{si, a.i)(si+i)>0 for all 
i £ N. The (*+l)th state Si of path tt is denoted n(i) and, if n is finite, last(rr) 
denotes its finał state. We write FPatliM,s and IPatłiM,s , respectively, for the set 
of all finite and infinite paths of AA starting in state s, and denote by FPattiM 
and IPatliM the sets of all such paths. 

Example 1. Fig.[j]shows an MDP Al, which we will use as a running example. 
It represents a robot moving through terrain that is divided up into a 3 x 2 
grid, with each grid section represented as one state. In each of the 6 States, one 
or morę actions from the set A = { north , east, south, west , stuck} are available, 
which move the robot between grid sections. Due to the presence of obstacles, 
certain actions are unavailable in some States or probabilistically move the robot 
to an alternative state. Action stuck , in states s 2 and S 3 , indicates that the robot 
is unable to move. In Fig. [l] the probabilistic transition function is drawn as 
grouped, labelled arrows; where the probability is 1, it is omitted. We also show 
labels for the states, taken from the set AP = { hazard , goal x , goal 2 }- 

Rewards and costs. We use reuiards as a generał way of modelling various 
additional quantitative measures of an MDP. Although the name “reward” sug- 
gests a ąuantity that it is desirable to maximise (e.g., profit), we will often use 
the same mechanism for costs , which would typically be minimised (e.g. energy 
consumption). In this paper, rewards are values attached to the actions available 
in each state, and we assume that these rewards are accumulated over time. 

Definition 2 (Reward structure). A reward structure for an MDP AA = 

(S , s, A, Sm , Lab) is a function of the form r:Sx4-> • 

Strategies. We reason about the behaviour of MDPs using strategies (which, 
depending on the context, are also known as policies, adnersaries or schedulers). 
A strategy resolves nondeterminism in an MDP, i.e., it chooses which action (or 
actions) to take in each state. In generał, this choice can depend on the history 
of the MDP’s execution so far and can be randomised. 

Definition 3 (Strategy). A strategy of an MDP AA = (S,s, A, Sm, Lab) is a 
function <7 : FPathM^Dist(A) such that cr( 7 r)(a )>0 only if a € A(last{7 r)). 
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Fig. 1. Running example: an MDP Ad representing a robot moving about a 3 x 2 grid. 

We denote by Em the set of all strategies of Ad, but in many cases we can restrict 
our attention to certain subclasses. In particular, we can classify strategies in 
terms of their use of randomisation and memory. 

1 . randomisation: we say that strategy a is deterministic (or pure ) if a(n) is 
a point distribution for all n £ FPathjw, and randomised otherwise; 

2. memory: a strategy a is memoryless if er(7r) depends only on last(n) and 
finite-memory if there are fmitely many modes such that cr(7r) depends only 
on last( 7 r) and the current modę, which is updated each time an action is 
performed; otherwise, it is infinite-memory. 

Under a particular strategy a of Ad, all nondeterminism is resolved and the 
behaviour of A i is fully probabilistic. Formally, we can represent this using an 
(infinite) induced discrete-time Markov chain, whose States are finite paths of 
AA. This leads us, using a standard construction J2S], to the definition of a 
probability measure Pr^ s over infinite paths IPatłiM.s , capturing the behaviour 
of Ad from state s under strategy er. We will also use, for a random variable 
A' : IPath,M,s —• ► the expected value of X from state s in Ad under strategy 
cr, denoted S (A). If s is the initial state s, we omit it and write Pr a M or E^,. 

3 Strategy Synthesis for MDPs 

We now explain the strategy synthesis problem for Markov decision processes 
and give a brief overview of the algoritlrms that can be used to solve it. From 
now on, unless stated otherwise, we assume a fixed MDP Ad = (S, s, A , Sm , Lab). 


3.1 Property Specification 

First, we need a way to formally specify a property of the MDP that we wish 
to hołd under the strategy to be synthesised. We follow the approacli usually 
adopted in probabilistic yerification and specify properties using temporal logie. 
Morę precisely, we will use a fragment of the property specification language 
from the PRISM model cliecker £25], the fuli version of which subsumes logics 
such as PCTL, probabilistic LTL, PCTL* and others. 
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For any MDP AA, State s £ S and strategy er: 


A4,s,cr|=R r M!l; [p] 


P r M,s({ n e IPathM.a 
®-M,s( reW ( r ’ P)) 1X3 X 


7 T |= Ip}) 1 X 1 p 


For any path 7r = so—^f-si —fs 2 ... £ IPatłiM- 


\= true 

|= b 

|= ipi A 1p2 

|= —iip 

\=J.ip 

\=lpl 

|= 1p l U 1p2 


always 
b £ L(so) 

TT |= Ipi A 7T |= 1p2 
TT^=1p 

SlS2 ■ ■ ■ \=tp 

3i ^ k . (sis i+ 1 ... |= tp2 A (Vj<i . SjSj+i ... \=ipi)) 


For any reward structure 

r and path n = so“^Si“As 2 ... £ IPathw. 

rew(r, C^ k )(n) = f ] 

rew(r,C){ir) = f ] 

reio(r, F b)[rr) = f 

Ztlris^a,) 

_ 4j~0 ^0 

00 if V? £ N : b £ L(sj), 

r ( s j ; °j) otherwise, where k = min-jj | b £ L(sj)} 


Fig. 2. Inductive definition of the property satisfaction relation |= . 


Definition 4 (Properties and objectives). For the purposes of this paper, a 
property is a formula f> deriued from the following grammar: 

< t > "= | Rmx[p] 

ip ::= true | b \ ip A ip \ ->ip | X ip \ ip U^ fc ip \ ip U ip 
p ::= C^ fc | C | F b 

where b £ AP is an atomie proposition, tx £ {^,<,^,>}, p £ [0,1], r is a 
reward structure, x £ R^o a nd k £ N. We refer to ip and p as objectives. 

A property is thus a single instance of either the PmpI^] operator, which asserts 
that the probability of a path satisfying (LTL) formula ip rneets the bound 
txi p, or the R^, x [p] operator, which asserts that the expected value of a reward 
objective p, using reward structure r, satisfies cxi x. For now, we forbid multiple 
occurrences of the P or R operators in the same propertyj^j as would typically 
be permitted when using branching-time probabilistic logics such as PCTL or 
PCTL* for verification of MDPs. This is because our primary focus in this 
tutorial is not verification, but strategy synthesis, for which the treatment of 
branching-time logics is morę challenging mm- 

For an MDP AA, State s and strategy er of AA, and property </>, we write 
AA, s, a \= (p to denote that, when starting from s, and operating under a, AA 
satisfies <f>. Generally, we are interested in the behaviour of A i from its initial 
state s, and we write AA, a \= <j> to denote that AA, s, a \= (p. A formal definition 
of the satisfaction relation |= is given in Fig. [2] Below, we identify several key 
classes of properties and explain them in morę detail. 


1 We will relax this restriction, for multi-objective strategy synthesis, in Sec. [ 4 ] 
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Probabilistic reachability. For probabilistic properties P^pI^L a sirnple but 
fundamental class of path formulae ip are “until” formulae of the form b\ U b 2 , 
where bi,b 2 are atomie propositions. Intuitively, &i U b 2 is true if a 62 -labelled 
State is eventually reached, whilst passing only through b\ States. Particularly 
useful is the derived operator F b = true U b, representing reachability, i.e., a b 
state is eventually reached. Another common derived operator is G b = ->F -> 6 , 
which captures inrariance, i.e., that b always remains true. Also useful are step- 
bounded variants. For example, step-bounded reachability, expressed as F^ fe b = 
true b, means that a 6 -state is reached within k steps. 

Probabilistic LTL. Morę generally, for the probabilistic properties P^ p [ip] 
defined in Defn. |4j ip can be any formula in the temporal logie LTL. This allows 
a wide variety of useful properties to be expressed. These include, for example: 
(i) G F b (infinitely often b ); (ii) F G b (eventually always 6 ); (iii) G (61 —>■ Xb 2 ) 
(62 always immediately follows 61 ); and (iv) G (&i — > F b 2 ) ( b 2 always eventually 
follows b\). Notice that, in order to provide convenient syntax for expressing 
step-bounded reachability (discussed above), we explicitly add a step-bounded 
until operator U^ fc . This is not normally included in the syntax of LTL, but does 
not add to its expressivity (e.g., bi U ^ 2 b 2 =b 2 V (&i A X b 2 ) V (61 A X bi A X X b 2 )). 

Reward properties. As explained in Sec. [2j rewards (or dually, costs) are 
values assigned to state-action pairs that we assume to be accumulated over 
time. Properties of the form R^aJp] refer to the expected accumulated vałue of 
a reward structure r. The period of time over which rewards are accumulated is 
specified by the operator p: for the first k steps (C^ fe ), indefinitely (C), or until 
a state labelled with b is reached (F b). In the finał case, if a 6-state is never 
reached, we assume that the accumulated reward is infinite. 

3.2 Verification and Strategy Synthesis 

Classically, probabilistic model checking is phrased in terms of verifying that a 
model M satisfies a property (p- For an MDP, this means checking that <p holds 
for all possible strategies of M. 

Definition 5 (Verification). The yerification problem is: given an MDP M 
and property (p, does A4,a\=<p hołd for all possible strategies cr £ Ej\ą ? 

In practice, this is closely related to the dual problem of strategy synthesis. 

Definition 6 (Strategy synthesis). The strategy synthesis problem is: ginen 
MDP M. and property (p, find, if it exists, a strategy a £ Ej\ą such that M, cr\=(p. 

Verification and strategy synthesis for a property (p on MDP At can be done in 
essentially the same way, by computing optimal values for either probability or 
expected reward objectives, defined as follows: 

Pr M n sW = inf {Pr a M Jip)} E™ n s {rew{r,p)) = inf {E %i tS (rew{r, p))} 

cr£ ^ J\Ą 

Pr M^sW= SU P i Pr M,sW} E M*s(rew(r,p)) = sup {E a Ms {rew[r,p))} 
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When s is the initial State s, we omit the subscript s. 

Verifying, for example, property (f> = P ^ P [ip] against M. or, dually, synthe- 
sising a strategy for <j>' = P^ p [^>] can both be done by computing Pr™ (?/>). For 
the former, A4 satisfies <f> if and only if Pr^ n (t/>) ^ p. For the latter, there exists 
a strategy er satisfying <// if and only if Pr ™(^) < p, in which case we can 
take a to be a corresponding optimcil strategy , i.e., one that achieves the optimal 
value. In generał, therefore, rather than fix a specific bound p , we often simply 
aim to compute an optimal value and accompanying optimal strategy. In this 
case, we adapt the syntax of properties to include numerical ąueries. 

Definition 7 (Numerical query). Let ip, r and p be as specifted in Defn.^4 A 
numerical ąuery takes the form P min =? [ ip} , P m ax=?[V , ] ) R min=?[/°] orR max=? p ] 
and yields the optimal value for the probability/reward objective. 

In the rest of this section, we describe how to compute optimal values and 
strategies for the classes of properties described above. We also explain which 
class of strategies suffices for optimality in each case (i.e., the smallest class of 
strategies which is guaranteed to contain an optimal one). This is important both 
in terms of the tractability of the solution methods, and the size and complexity 
of the controller that we might wish to construct from the synthesised strategy. 
As mentioned earlier, an extended version of this paper, available from [49] . 
presents fuli details of these methods. Coverage of this materiał can also be 
found in, for example, mm and standard texts on MDPs |fil2fil38j . 

3.3 Strategy Synthesis for Probabilistic Reachability 

To synthesise optimal strategies for probabilistic reachability, it suffices to con- 
sider memoryless deterministic strategies. For this class of properties, and for 
those covered in the following subsections, the bulk of the work for strategy 
synthesis actually amounts to computing optimal values. An optimal strategy is 
extracted either after or during this computation. 

Calculating optimal values proceeds in two phases: the first precomputation 
phase performs an analysis of the underlying graph structure of the MDP to 
identify States for which the probability is 0 or 1; the second performs numerical 
computation to determine values for the remaining States. The latter can be 
done using various methods: (i) by solution of a linear programming problem; 
(ii) policy iteration, which builds a seąuence of strategies (i.e., policies) with in- 
creasingly high probabilities until an optimal one is reached; (iii) value iteration , 
which computes increasingly precise approximations to the exact probabilities. 

The method used to construct an optimal strategy a* depends on how the 
probabilities were computed. Policy iteration is the simplest case, sińce a strategy 
is constructed as part of the algorithm. For the others, minimum probabilities 
are straightforward we choose the locally optimal action in each State: 

cr*(s) = argmin aeAW £ s , eS S(s, a){s') ■ Pr™ s ,(F b) 

Maximum probabilities reąuire morę care, but simple adapations to precompu¬ 
tation and value iteration algorithms yield an optimal strategy. 
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For step-bounded reachability, memoryless strategies do not sufHce: we need 
to consider the class of finite-memory deterministic strategies. Computation of 
optimal probabilities (and an optimal strategy) for step-bounded reachability 
amounts to working backwards through the MDP and determining, at each step, 
and for each State, which action yields optimal probabilities. In fact, this amounts 
sirnply to performing a fixed nurnber of steps of value iteration. 

Example 2. We return to the MDP Ai from Fig. [T]and synthesise a strategy 
satisfying the property P^o. 4 [F goal x }. To do so, we compute Pr™ x (F goal x ), 
which eąuals 0.5. This is achieved by the memoryless deterministic strategy that 
picks east in so, south in si and east in S 4 (there is no choice to make in States 
S 2 , S 3 and any action can be taken in 35 , sińce goal^ has already been reached). 

Next, we consider label goal 2 and a numerical ąuery P ma x=?[F^ fe goal 2 ) with 
a step-bounded reachability objective. We find that Pr'^ x (F^ k goal 2 ) is 0.8, 
0.96 and 0.99 for k = 1, 2 and 3, respectively. Taking k = 3 as an example, the 
optimal strategy is deterministic, but finite-memory. For example, if we arrive at 
state S 4 after 1 step, action east is optimal, sińce it reaches goal 2 with probability 
0.9. If, on the other hand, we arrive in S 4 after 2 steps, it is better to take west, 
sińce it would be impossible to reach goal 2 within k — 2 = 1 steps. 

3.4 Strategy Synthesis for Probabilistic LTL 

To synthesise an optimal strategy of MDP AA for an LTL formula if>, we reduce 
the problem to the simpler case of a reachability property on the product of 
AA and an w-automaton representing ifr. Here, we describe the approach of [2], 
which uses deterministic Rabin automata (DRAs) and computes the probability 
of reaching accepting end components. Since the minimum probability of an LTL 
formula can be expressed as the maximum probability of a negated formula: 

PrŻ n W = 1 - Pr%TW>) 

we only need to consider the computation of maximally optimal probabilities. 

A DRA A with alphabet a represents a set of infinite words C(A) C a u . For 
any LTL formula ^ using atomie propositions from AP, we can construct I45I18I5I 
a DRA Aif, with alphabet 2 AP that represents it, i.e., such that an infinite path 
tt = 4 -S 2 ... of AA satisfies ip if and only if Lab{so)Lab{s\)Lab{s 2 ) ... is 

in C(A^). We then proceed by building the (synchronous) product M®A^ of 
AA and A^. The product is an MDP with state space S x Q, where Q is the set 
of States of the DRA. We then have: 

Pr= PrZlA^ *cc) 

where acc is an atomie proposition labelling accepting end components of A4®A,p. 
An end component Ę] is a strongly connected sub-MDP of A4, and whether it is 
accepting is dictated by the acceptance condition of A^. Computing Pr™*(ip) 
thus reduces to identifying the set of all end components (see, e.g., m) and 
calculating the maximum probability of reaching the accepting ones. 
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To build an optimal strategy maximising the probability of an LTL formula, 
we need to consider finite-memory deterministic strategies. An optimal strategy 
of this class is constructed in two steps. First, we find a memoryless deterministic 
strategy for the product M , which maximises the probability of reaching 
accepting end components (and then stays in those end components, visiting 
each state infinitely often). Then, we convert this to a finite-memory strategy, 
with one modę for each state q £ Q of the DRA Ap. 

Example 3. Again, using the running example (Fig. [I]), we synthesise a strat¬ 
egy for the LTL property P^o.osl (G hazard ) A (G F goal ], which aims to both 
avoid the hazard-labeWed state and visit the goal x state infinitely often. The 
maximum probability, from the initial state, is 0.1. In fact, for this example, a 
memoryless strategy suffices for optimality: we choose south in state s o, which 
leads to state S 4 with probability 0.1. We then remain in States S 4 and S 5 indef- 
initely by choosing actions east and west , respectively. 

3.5 Strategy Synthesis for Reward Properties 

The techniąues reąuired to perform strategy synthesis for expected reward prop¬ 
erties R(xi x [ p] are, in fact, ąuite similar to those reąuired for the probabilistic 
reachability properties, described in Sec. |3.3[ For the case where p = F 6 , tech¬ 
niąues similar to those for P Mp [F 6 ] are used: first, a graph based analysis of 
the model (in this case, to identify States of the MDP from which the expected 
reward is infmite), and then methods such as value iteration or linear program- 
ming. The resulting optimal strategy is again memoryless and deterministic. 

For the case p = C, where rewards are accumulated indehnitely, we need 
to identify end components containing non-zero rewards, sińce these can result 
in the expected reward being infmite. Subseąuently, computation is similar to 
the case of p = F b above. For step-bounded properties p = C^ fe , the situation 
is similar to probabilities for step-bounded reachability; optimal strategies are 
deterministic, but may need finite memory, and optimal expected reward values 
can be computed recursively in k steps. 

Example 4. We synthesise an optimal strategy for minimising the number of 
moves that the robot makes (i.e., the number of actions taken in the MDP) 
before reaching a goal 2 state. We use a reward structure moves that rnaps all 
state-action pairs to 1, and a numerical ąuery R™™ff ? [F goal 2 }■ This yields the 
optimal value y|, achieved by the memoryless deterministic strategy that chooses 
south , east , west and north in States So,si,S 4 and S 5 respectively. 

4 Multi-objective Strategy Synthesis 

In this section, we describe multi-objective strategy synthesis for MDPs, which 
generates a strategy er that simultaneously satisfies multiple properties of the 
kind discussed in the previous section. We first describe the case for LTL prop¬ 
erties and then summarise some extensions. 
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Definition 8 (Multi-objective LTL). A multi-objective LTL property is a 
conjunction <f> = Po^pJfA] A ... A P M„p rl [V’n] of probabilistic LTL properties. 
For MDP A4 and strategy a, A4, c |= </> if A4, er \= P^^ [ Vh ] f or all 1 ^ i ^ n. 


An algorithm for multi-objective LTL strategy synthesis was given in [20] . 
although here we describe an adapted version, based on {22j, using deterministic 
Rabin automata. The overall approach is similar to standard (single-objective) 
LTL strategy synthesis in that it constructs a product automaton and reduces 
the problem to (multi-objective) reachability. 

First, we ensure that all n properties P MiPj [ 'ipi} contain only lower probabil- 
ity bounds cc £ {^,>}, by negating LTL formulae as reąuired (e.g., replacing 
P < p [il>] with P>i_ p [- 1 - 0 ]). Next, we build a DRA A^ i for each LTL formula i[>i, 
and construct the product MDP AT = M® ■ ■ ®Aę ri . We then consider 

each combination X C {1,... ,n} of objectives and find the end components of 
JA' that are accepting for all DRAs { Ai \ i £ X}. We create a special sink state 
for X in AT and add transitions from States in the end components to the sink. 

The problem then reduces to a multi-objective problem on AT for n reach¬ 


ability properties Pikupi [ F acci 


, Pm„p„ [ F acc n ], where acct represents the 
union of, for each set X containing ż, the sink States for X. This can be done by 
solving a linear programming (LP) problem |20| . 

Optimal strategies for multi-objective LTL may be finite-memory and ran- 
domised. A strategy can be constructed directly from the solution of the LP 


problem. Like for LTL objectives (in Sec. 3.4), we obtain a memoryless strategy 


for the product MDP and then convert it to a finite-memory one on A4. 


We now summarise several useful extensions and improvements. 


(i) Boolean combinations of LTL objectives (rather than conjunctions, as in 
Defn.[ 8 | can be handled via a translation to disjunctive normal form EOEa. 

(ii) expected reward objectives can also be supported, in addition to LTL prop¬ 
erties. The LP-based approach sketched above has beeen extended [22] to 
include reward objectives of the form R^^C]. An alternative approach, 
based on value iteration, rather than LP [23|, allows the addition of step- 
bounded reward objectives R^^C^ ] (and also provides significant gains in 
efficiency for both classes of properties). 

(iii) numerical multi-objective ąueries generalise the numerical ąueries explained 
in Defn. [T] For example, rather than synthesising a strategy satisfying prop¬ 
erty P^pj [ ipi ] A Pix 3 2 p 2 [ "02 ], w e can instead synthesise a strategy that max- 
imises the probability of ipi , whilst simultaneously satisfying ’P^ 2P2 [ i/j 2 ] ■ The 
LP-based methods mentioned above are easily extended to handle numerical 
ąueries by adding an objective function to the LP problem. 

(iv) Pareto ąueries [23] produce a Pareto curve (or an approximation of it) illus- 
trating the trade-off between multiple objectives. For example, if we want to 
maximise the probabilities of two LTL formulae ipi and the Pareto curve 
comprises points (pi,P 2 ) such that there is a strategy o with Pr^^ip i) ^ pi 
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Fig. 3. Pareto curve (dashed linę) for maximisation of the probabilities of LTL formulae 
%j )i = G -i hazard and if >2 = G F goal x (see Ex. [5|. 


and Pr^(ip 2 ) ^ P 2 , but, if either bound p± or P 2 is increased, no strategy 
exists without decreasing the other bound. 


We refer the reader to the references given above for precise details of the 
algorithms and any restrictions or assumptions that may apply. 

Example 5. Previously, in Ex. [3j we synthesised a strategy for the LTL prop- 
erty P^o.os[(G —'hazard) A (G F goal : )] and found that the maximum achievable 
probability was 0.1. Let us now consider each conjunct of the LTL formula as a 
separate objective and synthesise a strategy satisfying the multi-objective LTL 
property P^o. 7 [G -'hazard] A P^o. 2 [G F goal x ]. For convenience, we will abbrevi- 
ate the objectives to = G —>hazard and ^ = GF goal±. 

Following the procedurę outlined at the start of tliis section, we find that 
there is a strategy satisfying P^o.”[tfo] A P^o. 2[^2 ]■ To give an example of one 
such strategy, we consider a numerical multi-objective ąuery that maximises 
the probability of satisfying ip 2 whilst satisfying P^o.r [ ] - The optimal value 

(maximum probability for r/> 2 ) is ~ 0.2278, which is obtained by a randomised 
strategy that, in state s o, picks east with probability approximately 0.3226 and 
south with probability approximately 0.6774. 

Finally, we also show, in Fig. [3j the Pareto curve obtained when maximising 
the probabilities of both 1 and ip 2 - The grey shared area shows all points ( x , y) 
for which there is a strategy satisfying P^-i/h] A P ^ y [ip2]- Points along the 
top edge of this region, shown as a dashed linę in the figurę, form the Pareto 
curve. We also mark, as black circles, points (Pr^(^>i), Pr% i (ip 2 )) for specific 
deterministic strategies of M. The leftmost circle is the strategy described in 
the first part of Ex. [2| and the rightmost one is the strategy from Ex. [3] 


5 Controller Synthesis with Stochastic Games 

So far, we have assumed that the nondeterministic choices in the model represent 
the choices available to a single entity, such as a controller. In many situations, it 
is important to consider decisions being madę by multiple entities, possibly with 
conflicting objectives. An example is the classic formulation of the controller 
synthesis problem, in which a controller rnakes decisions about how to control, 
for example, a manufacturing plant, and must respond to nondeterministic be- 
haviour occurring in the enuironment of the plant. 
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It is natural to model and analyse such Systems using game-theoretic meth- 
ods, which are designed precisely to reason about the strategie decisions of com- 
peting agents. Frorn a modelling point of view, we generalise MDPs to stochastic 
games , in which nondeterministic choices are resolved by multiple players and 
the resulting behaviour is probabilistic (MDPs can thus be seen as 1-player 
stochastic games). We restrict our attention here to turn-based (as opposed to 
concurrent) stochastic games, in which a single player is responsible for the non- 
deterministic choices available in each State. In linę with the rest of the paper, 
we assume finite-state models and total information. 

Definition 9 (SMG). A (turn-based) stochastic multi-player gamę (SMG) is 
a tupie G = {n, S, (Si)i£n,s, A, 5g , Lab), where S, ś, A, and Lab are as for an 
MDP, in Defn. [7J TT is a finite set of players and (Si)i^n is a partition of S. 

An SMG G evolves in a similar way to an MDP, except that the nondeterministic 
choice in each State s is resolved by the player that Controls that State (the player 
i for which s G £,). Like MDPs, we reason about SMGs using strategies, but 
these are defined separately for each player: a strategy er, for player i is a function 
mapping finite patlrs ending in a State from 5) to a distribution over actions A. 
Given strategies oy,. ..,<7 *, for multiple players from LI, we can combine them 
into a single strategy a = oy,..., oy. If a strategy er comprises strategies for all 
players of the gamę (sometimes called a strategy profile), we can construct, like 
for an MDP, a probability space Prę over the infinite paths of Q. 

For strategy synthesis on SMGs, we generate strategies either for an indi- 
vidual player, or for a coalition C C LI of players. We extend the definition of 
properties given in Defn. [4] in the style of the logie rPATL na. 

Definition 10 (Multi-player strategy synthesis). For a property P^ p [i/j] 
or K r ^ x [p] and a coalition C C LI of players, (zero-sum) multi-player strategy 
synthesis is expressed by a ąuery ((C^P^pf^] or ((C'))R£ <!I .[p]. For example, 
((C , ))Pixip[V’] asks “does there exist a strategy oy for the players in C such that, 
for all strategies oy for the players Lt\C, we have Prg 1 ,<T2 {if) cc p?”. 

Intuitively, if (lfJ))'P l>3p [if] is true, then the players in C can collectively guaran- 
tee that P^pf'0] holds, regardless of wlrat the otlrer players do. Like for the other 
classes of strategy synthesis described in this paper, we can also use numerical 
ąueries for stochastic multi-player games. We write, for example, ((C , ))P max= ?[-0] 
to denote the maximum probability of if that players in C can guarantee, re¬ 
gardless of the actions of the other players. 

Multi-player strategy synthesis can be solved using rPATL model checking. 
Details for the the fuli logie rPATL (which allows nested operators and includes 
additional reward operators, but omits C^ k and C) are given in in [14]. Basically, 
model checking reduces to the analysis of zero-sum properties on a stochastic 
2-player gamę in which player 1 corresponds to C and player 2 to LI\C. For 
reachability properties (i.e., ((C))Px p [F 6]), memoryless deterministic strategies 
suffice and optimal values and strategies can be computed either with value it- 
eration or strategy iteration mm- For an LTL property if, we again reduce 
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{hazard} 




Fig. 4. Left: A stochastic 2-player gamę modelling an unknown probability interval. 
Right: The maximum value with which ctrl can guarantee reaching goal x for probability 
interval [p, q] = [0.5— A, 0.5 + A], See Ex. [6] for details. 

the problem to a simpler one on the product of the gamę Q and a deterministic 
automaton representing ip. Unlike MDPs, there is no notion of end components. 
Instead, we can either use the strategy improvement algorithm of m to directly 
compute probabilities for Rabin objectives, or convert the DRA to a determin¬ 
istic parity automaton and compute probabilities for parity objectives |12j . 

Example 6. To give an example of strategy synthesis using stochastic games, we 
consider a simple extension of the MDP M frorn the running example (Fig. [I]). 
We assume that the existing choices in the MDP are madę by a player ctrl, and 
we add a second player env, which represents nondeterministic aspects of the 
environment. Recall that transition probabilities in Ad model uncertain outcomes 
of robot actions due to obstacles. For example, when action south is taken in 
State Si of Ad, the MDP only moves in the intended direction (to S 4 ) with 
probability 0.5; otherwise, it moves to S 2 . Let us now instead assume that this 
probability (of going to S4) can vary in some interval [p,q\. 

Fig. [ 4 ] (left) shows how we can model this as a stochastic two-player game(^| 
States controlled by player ctrl are, as before, shown as circles; those controlled 
by player env , are shown as sąuares. When action south is taken in state Si, we 
move to a new state S6, in which player env can choose between two actions, 
one that leads to S 4 with probability p and one that does so with probability q. 
Since players are allowed to select actions at random, this means player env can 
effectively cause the transition to occur with any probability in the rangę [p, q]. 

In Ex. [2j we computed the maximum probability of reaching goal x as 0.5. 
Now, we will consider the maximum probability that ctrl can guarantee, re- 
gardless of the choices madę by env , i.e., the maximum probability of reaching 
goal l7 if the probability of moving to state S 4 after action south can be any value 
in [p,q\. This is done with the ąuery {(ctrl)) P max =?[F goa We compute this 
value for various intervals [p, q] centred around the original value of 0.5, i.e., we 

2 This notion can be captured morę cleanly by annotating transitions directly with 
probability intervals 23) or with morę generał specifications of uncertainty [391 . 
Here, we just aim to give a simple illustration of using a stochastic 2-player gamę. 
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let p = 0.5— A, q = 0.5+Z\ and vary A. Fig. [4] (right) plots the result. From 
inspection of the gamę, we can deduce that the plot corresponds to the function 
min(0.5 — A, 0.15 — 0.1Z\). This means that, if A ^ ^ (i.e., ii p ^ |), then it is 
better to switch to the strategy that picks south in state so, rather than east. 

6 Challenges and Directions 

In this paper, we have given a brief overview of strategy synthesis for proba¬ 
bilistic Systems, pointing to some promising application areas, higlilighting the 
benefits that can be derived from existing work on probabilistic yerification, and 
summarising the algorithmic techniąues reąuired for a variety of useful strategy 
synthesis methods. We invite the reader to consult the extended version of this 
paper [SU] for further details. 

As noted at the start, the current presentation rnakes a number of simplifying 
assumptions. We conclude by reviewing some of the key challenges in the area 
of strategy synthesis for probabilistic Systems. 

— Partial observability. In this paper, we assumed a complete Information set- 
ting, where the state of the model (and the States of its history) are fully 
yisible when a strategy chooses an action to take. In many situations, this is 
unrealistic, which could lead to strategies being synthesised that are not fea- 
sible in practice. Although fundamental decision problems are undecidable in 
the context of partial obseryability [5], practical implementations have been 
developed for a few cases mm and some tool support exists [3B] . Develop- 
ing efficient methods for useful problem classes is an important challenge. 

— Robustness and uncertainty. In many potential applications of strategy syn¬ 
thesis, such as the generation of controllers in embedded Systems, it may 
be difficult to formulate a precise model of the stochastic beliayiour of the 
system’s environment. Thus, developing appropriate models of uncertainty, 
and corresponding methods to synthesise strategies that are robust in these 
environments, is important. We gave a very simple illustration of uncertain 
probabilistic behaviour in Sec. [5] Developing morę sophisticated approaches 
is an active area of research [2HI3T[ . 

— Continuous time and space. In this paper, we focused on discrete-time prob¬ 
abilistic models. Verification techniąues have also been developed for models 
that incorporate both nondeterminism and continuous notions of time, in- 
cluding probabilistic tirned automata [35] > interactive Markov chains [55] and 
Markov automata [33] . Similarly, progress is being rnade on yerification tech¬ 
niąues for models with continuous state spaces, and hybrid models that mix 
both discrete and continuous elements [481441 . Developing efficient strategy 
synthesis techniąues for such models will bring the benefits of the methods 
discussed in this paper to a much wider rangę of application domains. 
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